Python for Data Science

AI Bootcamp


PHALLY MAKARA

Course Outline

  1. Introduction to Data Science
  2. Python Ecosystem for Data Science
  3. NumPy Fundamentals (Numerical Computing)
  4. Pandas Fundamentals (Data Manipulation)
  5. Exploratory Data Analysis (EDA)
  6. Data Visualization
    • Matplotlib
    • Seaborn
    • Plotly

NumPy Pandas Matplotlib Seaborn

Introduction to Data Science

What is Data Science?

Data science is an interdisciplinary field that combines:

  • Scientific methods and processes
  • Statistical analysis and algorithms
  • Machine learning principles
  • Big data technologies

Goal: Extract knowledge, insights, and discover hidden patterns from structured and unstructured data collected from various sources (web, smartphones, sensors, customers, etc.).

Why Data Science?

Data science or data-driven science enables better decision making, predictive analysis, and pattern discovery:

  • Find the leading cause of a problem by asking the right questions
  • Perform exploratory study on the data
  • Model the data using various algorithms
  • Communicate and visualize the results via graphs, dashboards, etc.

Example 1: Facebook Recommendation

Example 2: Automatic Image Captioning

Example 3: Product Recommendation

Lifecycle of Data Science Project

Categories of Data

In data science and big data, the user may come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these:

  • Structured
  • Unstructured
  • Natural language
  • Graph-based
  • Audio, video, and images

Data Types

Python Ecosystem for Data Science

Python Library for Data Science

NumPy Fundamental

NumPy: Numerical Computing

  • Introduces objects for multidimensional arrays and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects

  • Provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance

Link: http://www.numpy.org/

NumPy: NumPy Array

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

NumPy: Creating NumPy Arrays

NumPy arrays are preferred over lists and tuples for their efficiency, especially when working with large datasets.

import numpy as np

# 1D Array
array_1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", array_1d)
print("Shape:", array_1d.shape)

# 2D Array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", array_2d)
print("Shape:", array_2d.shape)

# 3D Array
array_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("3D Array:\n", array_3d)
print("Shape:", array_3d.shape)

NumPy: Special Functions to Create Arrays

NumPy provides many functions to create arrays:

import numpy as np

a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"

c = np.full((2,2), 7)  # Create a constant array
print(c)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

d = np.eye(2)         # Create a 2x2 identity matrix
print(d)              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"

e = np.random.random((2,2))  # Create an array filled with random values
print(e)                     # Might print "[[ 0.91940167  0.08143941]
                             #               [ 0.68744134  0.87236687]]"

NumPy: Array Indexing

Example 1: 1D Array Indexing

import numpy as np

# Create a 1D NumPy array
arr = np.array([10, 20, 30, 40, 50])

# Access elements by index
arr[0]      # 10
arr[2]      # 30
arr[-1]     # 50 (last element)

Note

  • Indexing starts at 0
  • Negative index means from the end

Example 2: 2D Array Indexing

import numpy as np

# Create a 2D NumPy array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Access elements by row and column index
arr_2d[0, 0]    # 1 (first row, first column)
arr_2d[1, 2]    # 6 (second row, third column)
arr_2d[2, 1]    # 8 (third row, second column)
arr_2d[-1, -1]  # 9 (last row, last column)

Example 3: 3D Array Indexing

import numpy as np
# Create a 3D NumPy array
arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
# Access elements by depth, row, and column index
arr_3d[0, 0, 0]   # 1 (first depth, first row, first column)
arr_3d[0, 1, 1]   # 4 (first depth, second row, second column)
arr_3d[1, 0, 1]   # 6 (second depth, first row, second column)
arr_3d[1, 1, 0]   # 7 (second depth, second row, first column)

NumPy: Array Slicing

Example 1: 1D Array Slicing

import numpy as np

# Create a 1D NumPy array
arr = np.array([10, 20, 30, 40, 50, 60, 70])

# Slicing: arr[start:stop:step]
arr[1:4]      # [20, 30, 40] (index 1 to 3)
arr[:3]       # [10, 20, 30] (start to index 2)
arr[3:]       # [40, 50, 60, 70] (index 3 to end)
arr[::2]      # [10, 30, 50, 70] (every 2nd element)
arr[::-1]     # [70, 60, 50, 40, 30, 20, 10] (reverse)

Note

  • Slicing syntax: [start:stop:step]
  • stop index is exclusive

Example 2: 2D Array Slicing

import numpy as np

# Create a 2D NumPy array
arr_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

# Slicing rows and columns
arr_2d[0:2, 1:3]     # [[2, 3], [6, 7]] (rows 0-1, columns 1-2)
arr_2d[:, 2]         # [3, 7, 11] (all rows, column 2)
arr_2d[1, :]         # [5, 6, 7, 8] (row 1, all columns)
arr_2d[::2, ::2]     # [[1, 3], [9, 11]] (every 2nd row & column)

Example 3: 3D Array Slicing

import numpy as np

# Create a 3D NumPy array
arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])

# Slicing across dimensions
arr_3d[0:2, :, :]    # First 2 depths, all rows and columns
arr_3d[:, 0, :]      # All depths, first row, all columns: [[1, 2], [5, 6], [9, 10]]
arr_3d[:, :, 1]      # All depths, all rows, second column: [[2, 4], [6, 8], [10, 12]]

NumPy: Basic Array Attributes

import numpy as np
numpy_ex = np.array([[1, 2, 3], [4, 5, 6]])
Attribute Description Example Result
ndim Number of dimensions numpy_ex.ndim 2
shape Size in each dimension numpy_ex.shape (2, 3)
size Total number of elements numpy_ex.size 6
dtype Data type of elements numpy_ex.dtype int64
T Transpose the array numpy_ex.T [[1,4],[2,5],[3,6]]

NumPy: Array Mathematics

import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
Operation Description Example Result
+ Element-wise addition a + b [6, 8, 10, 12]
- Element-wise subtraction a - b [-4, -4, -4, -4]
* Element-wise multiplication a * b [5, 12, 21, 32]
/ Element-wise division a / b [0.2, 0.33, 0.43, 0.5]
** Element-wise power a ** 2 [1, 4, 9, 16]
sum() Sum of all elements np.sum(a) 10
mean() Average value np.mean(a) 2.5
min() Minimum value np.min(a) 1
max() Maximum value np.max(a) 4
dot() Dot product a.dot(b) 70
reshape() Change array shape a.reshape(2, 2) [[1, 2], [3, 4]]

Pandas Fundamentals

Pandas: Panel Data

  • Adds data structures and tools designed to work with table-like data (similar to Series and Data Frames in R)

  • Provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc.

  • Allows handling missing data

Pandas: Series

A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them.

Series can be created from a scalar, a list, ndarray or dictionary using pd.Series() (note the capital “S”).

Pandas: Creating Series

Example 1: From a List

import pandas as pd

# Create Series from a list
s = pd.Series([10, 20, 30, 40, 50])
print(s)
# Output:
# 0    10
# 1    20
# 2    30
# 3    40
# 4    50
# dtype: int64

Example 2: From a Dictionary

# Create Series from a dictionary (keys become labels)
data = {'a': 100, 'b': 200, 'c': 300}
s = pd.Series(data)
print(s)
# Output:
# a    100
# b    200
# c    300
# dtype: int64

Pandas: DataFrame

Pandas DataFrames are your new best friend. They are like the Excel spreadsheets you may be used to.

DataFrames are really just Series stuck together! Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data.

Pandas: Creating DataFrame

Example 1: Basic DataFrame

import pandas as pd

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
print(df)

Example 2: DataFrame with Custom Labels

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],
                  index = ["R1", "R2", "R3"],
                  columns = ["C1", "C2", "C3"])
print(df)

Pandas: Indexing and Slicing DataFrames

There are several main ways to select data from a DataFrame:

import pandas as pd

df = pd.DataFrame({"Name": ["Shang", "Yuttey", "Sakada"],
                   "Language": ["Python", "Python", "R"],
                   "Courses": [5, 4, 7]})
Method Description Example Output
[] Select column(s) df["Name"] ["Shang", "Yuttey", "Sakada"]
.loc[] Label-based indexing df.loc[0, "Name"] "Shang"
df.loc[0:1, ["Name", "Courses"]] Rows 0-1, Name & Courses columns
.iloc[] Integer position-based df.iloc[0, 0] "Shang"
df.iloc[0:2, 0:2] First 2 rows, first 2 columns
Boolean Condition-based filtering df[df["Courses"] > 5] Rows where Courses > 5
.query() SQL-like string query df.query("Language == 'Python'") Rows with Python language

Pandas: Indexing Cheatsheet

Method Syntax Output
Select column df[col_label] Series
Select row slice df[row_1_int:row_2_int] DataFrame
Select row/column by label df.loc[row_label(s), col_label(s)] Object for single selection, Series for one row/column, otherwise DataFrame
Select row/column by integer df.iloc[row_int(s), col_int(s)] Object for single selection, Series for one row/column, otherwise DataFrame
Select by row integer & column label df.loc[df.index[row_int], col_label] Object for single selection, Series for one row/column, otherwise DataFrame
Select by row label & column integer df.loc[row_label, df.columns[col_int]] Object for single selection, Series for one row/column, otherwise DataFrame
Select by boolean df[bool_vec] Object for single selection, Series for one row/column, otherwise DataFrame
Select by boolean expression df.query("expression") Object for single selection, Series for one row/column, otherwise DataFrame

Pandas: Reading/Writing Data From External Sources

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.

Common Methods:

  • pd.read_csv() - Load CSV file

  • pd.read_excel() - Load Excel file

Pandas: Titanic Dataset

To load your data from various file types such as CSV, Excel, or JSON, first define your file path:

import pandas as pd
import numpy as np

# Define file path
file_path = "https://raw.githubusercontent.com/MorkMongkul/AI-Bootcamp-Instinct/main/Data/Titanic-Dataset.csv"
# Load CSV file
df = pd.read_csv(file_path)
# Load Excel file
df = pd.read_excel(file_path)
# Load JSON file
df = pd.read_json(file_path)

Note

  • pd.read_csv() - Load CSV file
  • pd.read_excel() - Load Excel file
  • pd.read_json() - Load JSON file

Pandas: Data Inspection Methods

After loading your data, use these methods to inspect and understand your dataset:

Method Description Example
df.head(n) View first n rows (default: 5) df.head(20) → First 20 rows
df.tail(n) View last n rows (default: 5) df.tail(10) → Last 10 rows
df.shape Get dimensions (rows, columns) df.shape(891, 12)
df.columns Get all column names df.columns → Index of column names
df.info() Get column info, dtypes, missing values Shows full data overview
df.describe() Get statistical summary Mean, std, min, max, quartiles

Recommendation

Run df.info() immediately after loading your data. It provides a comprehensive view of column names, data types, and missing values - giving you a quick understanding of your data and any issues to handle.

Pandas: Selecting Rows & Columns

Use the Titanic df loaded earlier to select rows and columns in different ways:

1) Select column(s) by name

df["Age"]
df[["Age", "Fare", "Survived"]]

2) Select rows by index (iloc – Integer Location)

# Single row
df.iloc[0]

# Multiple rows
df.iloc[0:5]

# Rows & columns together
df.iloc[0:5, 1:4]

3) Select rows by label (loc – Label Location)

df.loc[df["Sex"] == "female"]

4) Boolean indexing

df[df["Survived"] == 1]

# Multiple conditions
df[(df["Sex"] == "female") & (df["Age"] < 30)]

5) Query syntax

df.query("Age > 30 and Sex == 'male'")

Tip

When combining multiple conditions, wrap each condition in parentheses: (cond1) & (cond2) or (cond1) | (cond2).

Pandas: Data Selection & Filtering — Cheat Sheet

Method Used For
df["col"] Select one column
df[["c1","c2"]] Select multiple columns
iloc Position-based selection
loc Label & condition-based
Boolean indexing Filtering rows
query() Readable conditions
head() / tail() Inspection
sample() Random sampling
isna() Missing value filtering
select_dtypes() Feature selection

Pandas: Mathematical & Statistical Operations

Use Titanic df to compute descriptive statistics and aggregations.

Descriptive Statistics

# Single column statistics
df["Age"].median()     # Middle value
df["Age"].min()        # Minimum age
df["Age"].max()        # Maximum age
df["Age"].sum()        # Total sum

# Summary statistics for all numeric columns
df.describe()

# Mean of selected columns
df[["Age", "Fare"]].mean()

GroupBy Aggregations

# Survival rate by gender
df.groupby("Sex")["Survived"].mean()

Note

groupby() is powerful for aggregations like mean(), sum(), count(), min(), max().

Pandas: Handling Missing Values

Detect, remove, or impute missing data to prepare for analysis.

# Detect missing values per column
df.isna().sum()

# Drop rows with any missing values (use sparingly)
df.dropna()

# Fill/impute missing values with median (common for numeric)
df["Age"].fillna(df["Age"].median(), inplace=True)

Best Practice

Prefer imputation over dropping rows to preserve data. Choose appropriate statistics (median for skewed data, mean for normal distributions).

Pandas: Data Type Handling

Convert data types and encode categorical variables.

# Change data type to category (saves memory, enables categorical operations)
df["Survived"] = df["Survived"].astype("category")
df["Survived"].dtype

# Map categorical strings to numeric values (simple encoding)
df["Sex"] = df["Sex"].map({"male": 0, "female": 1})

For Machine Learning

Use pd.get_dummies() for one-hot encoding or sklearn.preprocessing.LabelEncoder for more robust categorical handling.

Pandas: Handling Duplicates

Identify and remove duplicate rows from your dataset.

# Check for duplicate rows (returns boolean Series)
df.duplicated()

# Remove duplicate rows (keeps first occurrence by default)
df.drop_duplicates()

Advanced Options

  • Use subset=["col1", "col2"] to check duplicates based on specific columns
  • Use keep="last" or keep=False to control which duplicates to keep

Matplotlib/Seaborn Fundamentals

Matplotlib: Python 2D Plotting Library

  • Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats

  • A set of functionalities similar to those of MATLAB

  • Line plots, scatter plots, barcharts, histograms, pie charts etc.

  • Relatively low-level; some effort needed to create advanced visualization

Link: https://matplotlib.org/

Matplotlib: Installation and Import

Installation:

!pip install matplotlib

Import with Alias:

import matplotlib.pyplot as plt

Matplotlib: Key Components

Component Code Example
Figure plt.figure(figsize=(width, height))
Plot plt.plot(x, y, marker='o', label="label1")
plt.plot(x, y2, marker='s', label="label2")
Labels plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")
plt.title("Plot Title")
Ticks plt.xticks([x_values])
plt.yticks([y_values])
Legend plt.legend()
Gridlines plt.grid(True)
Display plt.show()

Example:

import matplotlib.pyplot as plt

# Sample data
exams = [1, 2, 3, 4, 5]
math_scores = [60, 65, 70, 78, 85]
science_scores = [58, 63, 68, 75, 80]

plt.figure(figsize=(7, 5))

plt.plot(exams, math_scores, marker='o', label="Math")
plt.plot(exams, science_scores, marker='s', label="Science")

plt.xlabel("Exam Number")
plt.ylabel("Score")
plt.title("Student Exam Scores")

plt.xticks(exams)
plt.yticks([50, 60, 70, 80, 90])

plt.legend()
plt.grid(True)
plt.show()

Matplotlib: Figure and Axes Objects

Component Code Example
Figure & Axes fig, ax = plt.subplots()
Plot ax.plot(x, y, label='label1', marker='o')
ax.plot(x, y2, label='label2', marker='s')
Labels ax.set_xlabel('X-axis Label')
ax.set_ylabel('Y-axis Label')
ax.set_title('Plot Title')
Ticks ax.set_xticks([x_values])
ax.set_yticks([y_values])
Legend ax.legend()
Gridlines ax.grid(True, linestyle='--', alpha=0.7)
Display plt.show()

Example:

import matplotlib.pyplot as plt

# Sample data
exams = [1, 2, 3, 4, 5]
math_scores = [60, 65, 70, 78, 85]
science_scores = [58, 63, 68, 75, 80]

# Create figure and axes
fig, ax = plt.subplots()

# Plot lines
ax.plot(exams, math_scores, label='Math Scores', marker='o')
ax.plot(exams, science_scores, label='Science Scores', marker='s')

# Add labels
ax.set_xlabel('Exam Number')           # X-axis label
ax.set_ylabel('Scores')                # Y-axis label
ax.set_title('Student Performance')    # Title

# Customize ticks
ax.set_xticks(exams)                   # X-tick positions
ax.set_yticks([50, 60, 70, 80, 90])    # Y-tick positions

# Add legend and gridlines
ax.legend()                            # Legend
ax.grid(True, linestyle='--', alpha=0.7)  # Gridlines

plt.show()

Matplotlib: Types of Plots Overview

Different plot types are suited for different data types and analysis goals:

Plot Type Best For Data Type
Line Plot Trends over time/continuous data Time series, continuous variables
Scatter Plot Relationships between variables Two continuous variables
Bar Chart Comparing categories Categorical vs numerical
Histogram Distribution of single variable Single continuous variable
Pie Chart Parts of a whole (proportions) Categorical data (percentages)
Box Plot Distribution & outliers Continuous data across categories
Heatmap Correlation or matrix data 2D array/matrix data

Matplotlib: Line Plot

Use Case: Show trends over time or continuous relationships

When to Use: Time series data, tracking changes, showing trends

import matplotlib.pyplot as plt

months = ['Jan', 'Feb', 'Mar', 
          'Apr', 'May', 'Jun']
sales = [15000, 18000, 16500, 
         21000, 23500, 25000]

plt.figure(figsize=(8, 5))
plt.plot(months, sales, 
         marker='o', 
         linewidth=2, 
         color='blue', 
         label='Sales')
plt.xlabel('Month')
plt.ylabel('Sales ($)')
plt.title('Monthly Sales Trend')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

Matplotlib: Scatter Plot

Use Case: Examine relationships between two continuous variables

When to Use: Correlation analysis, finding patterns, identifying clusters

import matplotlib.pyplot as plt
import numpy as np

# Sample data: study hours vs exam scores
study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
exam_scores = [50, 55, 60, 65, 75, 80, 85, 90]

plt.figure(figsize=(8, 5))
plt.scatter(study_hours, exam_scores, 
            s=100, color='green', 
            alpha=0.6, edgecolors='black')
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.title('Study Hours vs Exam Score')
plt.grid(True, alpha=0.3)
plt.show()

Matplotlib: Bar Chart

Use Case: Compare values across different categories

When to Use: Category comparisons, survey results, rankings

import matplotlib.pyplot as plt

# Sample data: product sales by category
categories = ['Electronics', 'Clothing', 
              'Food', 'Books', 'Toys']
sales = [45000, 32000, 28000, 18000, 15000]

plt.figure(figsize=(8, 5))
plt.bar(categories, sales, 
        color=['#FF6B6B', '#4ECDC4', 
               '#45B7D1', '#FFA07A', '#98D8C8'])
plt.xlabel('Product Category')
plt.ylabel('Sales ($)')
plt.title('Sales by Product Category')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Matplotlib: Histogram

Use Case: Show distribution and frequency of a single variable

When to Use: Understanding data distribution, identifying skewness, finding ranges

import matplotlib.pyplot as plt
import numpy as np

# Sample data: student ages in a class
ages = np.random.normal(20, 2, 100)

plt.figure(figsize=(8, 5))
plt.hist(ages, bins=15, color='purple', 
         alpha=0.7, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Student Ages')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Matplotlib: Pie Chart

Use Case: Show proportions and percentages of a whole

When to Use: Market share, budget allocation, composition analysis

import matplotlib.pyplot as plt

# Sample data: budget allocation
categories = ['Marketing', 'R&D', 
              'Operations', 'HR', 'IT']
budget = [30, 25, 20, 15, 10]
colors = ['#FF9999', '#66B2FF', '#99FF99', 
          '#FFCC99', '#FF99CC']

plt.figure(figsize=(8, 6))
plt.pie(budget, labels=categories, 
        autopct='%1.1f%%', 
        startangle=90, colors=colors)
plt.title('Department Budget Allocation')
plt.axis('equal')
plt.show()

Matplotlib: Box Plot

Use Case: Show distribution, median, quartiles, and outliers

When to Use: Comparing distributions, identifying outliers, statistical summary

import matplotlib.pyplot as plt
import numpy as np

# Sample data: test scores
class_a = np.random.normal(75, 10, 50)
class_b = np.random.normal(80, 8, 50)
class_c = np.random.normal(70, 12, 50)

data = [class_a, class_b, class_c]

plt.figure(figsize=(8, 5))
plt.boxplot(data, 
            labels=['Class A', 'Class B', 
                    'Class C'], 
            patch_artist=True)
plt.ylabel('Test Scores')
plt.title('Test Score Distribution by Class')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Matplotlib: Heatmap

Use Case: Visualize matrix data, correlations, or intensity values

When to Use: Correlation matrices, confusion matrices, time-based patterns

import matplotlib.pyplot as plt
import numpy as np

# Sample data: correlation matrix
data = np.random.rand(5, 5)
labels = ['Math', 'Science', 'English', 
          'History', 'Art']

plt.figure(figsize=(8, 6))
plt.imshow(data, cmap='YlOrRd', 
           aspect='auto')
plt.colorbar(label='Correlation')
plt.xticks(range(5), labels, rotation=45)
plt.yticks(range(5), labels)
plt.title('Subject Correlation Heatmap')
plt.tight_layout()
plt.show()

Matplotlib: Multiple Subplots

Use Case: Display multiple plots side by side for comparison

import matplotlib.pyplot as plt
import numpy as np

# Sample data
x = np.linspace(0, 10, 100)

fig, axes = plt.subplots(2, 2, 
                         figsize=(10, 8))

# Plot 1: Line
axes[0, 0].plot(x, np.sin(x), 'b-')
axes[0, 0].set_title('Sine Wave')

# Plot 2: Scatter
axes[0, 1].scatter(x, np.cos(x), 
                   c='red', alpha=0.5)
axes[0, 1].set_title('Cosine Scatter')

# Plot 3: Bar
axes[1, 0].bar(['A', 'B', 'C'], [3, 7, 5])
axes[1, 0].set_title('Bar Chart')

# Plot 4: Histogram
axes[1, 1].hist(np.random.randn(1000), 
                bins=30, color='green', 
                alpha=0.7)
axes[1, 1].set_title('Histogram')

plt.tight_layout()
plt.show()

Seaborn Fundamentals

Seaborn: Statistical Data Visualization

  • Built on top of Matplotlib with a high-level interface for drawing attractive statistical graphics

  • Provides beautiful default styles and color palettes

  • Designed to work seamlessly with pandas DataFrames

  • Specialized for statistical visualizations with less code

Link: https://seaborn.pydata.org/

Seaborn: Installation and Import

Installation:

!pip install seaborn

Import with Alias:

import seaborn as sns
import matplotlib.pyplot as plt  # Often used together
import pandas as pd

Note

Seaborn is built on Matplotlib, so you’ll often use both libraries together. Seaborn for high-level plotting and Matplotlib for fine-tuning.

Seaborn vs Matplotlib

Feature Matplotlib Seaborn
Level Low-level, more control High-level, simpler syntax
Default Style Basic, requires customization Beautiful out-of-the-box
Statistical Plots Requires manual calculation Built-in statistical functions
Pandas Integration Manual data preparation Direct DataFrame support
Code Length More verbose More concise
Use Case Full customization needed Quick statistical visualization

Best Practice

Use Seaborn for initial exploration and statistical plots, then switch to Matplotlib when you need fine-grained control.

Seaborn: Plot Categories

Seaborn organizes plots into categories based on their purpose:

Category Purpose Key Functions
Relational Relationships between variables scatterplot(), lineplot()
Distributional Distribution of variables histplot(), kdeplot(), boxplot()
Categorical Categorical comparisons barplot(), countplot(), boxplot()
Regression Statistical relationships regplot(), lmplot()
Matrix Matrix data visualization heatmap(), clustermap()

Seaborn: Setting Styles

Seaborn provides built-in themes to quickly change plot appearance:

import seaborn as sns
import matplotlib.pyplot as plt

# Available styles:
# 'darkgrid', 'whitegrid', 'dark', 
# 'white', 'ticks'

# Set style
sns.set_style("whitegrid")

# Set context for scaling
# 'paper', 'notebook', 'talk', 'poster'
sns.set_context("talk")

# Set color palette
sns.set_palette("husl")

Style Examples: - darkgrid: Dark background with grid - whitegrid: White background with grid - dark: Dark background, no grid - white: White background, no grid - ticks: White with ticks on axes

Context Examples: - paper: Smallest (for papers) - notebook: Default size - talk: Larger (for presentations) - poster: Largest (for posters)

Seaborn: Color Palettes

import seaborn as sns
import matplotlib.pyplot as plt

# Qualitative palettes (categorical)
sns.color_palette("Set2")
sns.color_palette("Paired")

# Sequential palettes (continuous)
sns.color_palette("Blues")
sns.color_palette("rocket")

# Diverging palettes (two extremes)
sns.color_palette("coolwarm")
sns.color_palette("vlag")

# Custom palette
custom = ["#FF6B6B", "#4ECDC4", "#45B7D1"]
sns.set_palette(custom)

Seaborn: Distribution Plot (histplot)

Use Case: Visualize distribution of a single variable with histogram

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Sample data
np.random.seed(42)
data = np.random.normal(100, 15, 1000)

# Create distribution plot
plt.figure(figsize=(10, 6))
sns.histplot(data, kde=True, 
             color='skyblue', bins=30)
plt.title('Distribution Plot with KDE')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

KDE (Kernel Density Estimate)

Shows smooth probability density curve overlaid on histogram.

Seaborn: Box Plot

Use Case: Compare distributions across categories, identify outliers

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
df = pd.DataFrame({
    'Category': ['A']*50 + ['B']*50 + ['C']*50,
    'Values': np.concatenate([
        np.random.normal(20, 5, 50),
        np.random.normal(30, 7, 50),
        np.random.normal(25, 4, 50)
    ])
})

plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Category', 
            y='Values', palette='Set2')
plt.title('Box Plot Comparison')
plt.show()

Seaborn: Violin Plot

Use Case: Show distribution shape with more detail than box plot

import seaborn as sns
import matplotlib.pyplot as plt

# Using same data from previous slide
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='Category', 
               y='Values', palette='muted')
plt.title('Violin Plot - Distribution Shape')
plt.show()

Violin vs Box Plot

Violin plots show the full distribution shape (density), while box plots show quartiles and outliers.

Seaborn: Bar Plot with Error Bars

Use Case: Compare means across categories with confidence intervals

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
df = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'D', 'E'],
    'Mean': [23, 45, 32, 38, 41],
    'StdDev': [3, 5, 4, 3, 6]
})

plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='Category', 
            y='Mean', palette='viridis',
            errorbar='sd')
plt.title('Bar Plot with Error Bars')
plt.ylabel('Average Value')
plt.show()

Seaborn: Count Plot

Use Case: Show frequency of categorical variables

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
df = pd.DataFrame({
    'Category': ['A']*45 + ['B']*32 + 
                ['C']*28 + ['D']*15 + ['E']*23
})

plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Category', 
              palette='pastel')
plt.title('Count Plot - Frequency Distribution')
plt.ylabel('Count')
plt.show()

Seaborn: Scatter Plot with Regression

Use Case: Show relationship between two variables with trend line

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Sample data
np.random.seed(42)
x = np.random.rand(100) * 100
y = 2 * x + np.random.randn(100) * 15

plt.figure(figsize=(10, 6))
sns.regplot(x=x, y=y, 
            scatter_kws={'alpha':0.5},
            line_kws={'color':'red'})
plt.title('Scatter Plot with Regression Line')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.show()

Seaborn: Heatmap with Annotations

Use Case: Visualize correlation matrix or 2D data

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Sample correlation matrix
np.random.seed(42)
data = np.random.rand(5, 5)
labels = ['A', 'B', 'C', 'D', 'E']

plt.figure(figsize=(10, 8))
sns.heatmap(data, annot=True, fmt='.2f',
            cmap='coolwarm', 
            xticklabels=labels,
            yticklabels=labels,
            cbar_kws={'label': 'Correlation'})
plt.title('Correlation Heatmap')
plt.show()

Seaborn: Pair Plot

Use Case: Explore relationships between all variable pairs

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris(as_frame=True)
df = iris.frame
df['species'] = df['target'].map({
    0: 'setosa', 
    1: 'versicolor', 
    2: 'virginica'
})

# Create pair plot
sns.pairplot(df, hue='species', 
             palette='Set1')
plt.show()

Pair Plot

Shows scatter plots for all variable combinations and distributions on diagonal.

Seaborn: FacetGrid (Multiple Subplots)

Use Case: Create multiple plots based on categorical variables

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Sample data
np.random.seed(42)
df = pd.DataFrame({
    'x': np.random.rand(300) * 100,
    'y': np.random.rand(300) * 100,
    'category': np.repeat(['A', 'B', 'C'], 100)
})

# Create FacetGrid
g = sns.FacetGrid(df, col='category', 
                  height=4)
g.map(sns.scatterplot, 'x', 'y')
g.add_legend()
plt.show()

Seaborn: Joint Plot

Use Case: Combine scatter plot with marginal distributions

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Sample data
np.random.seed(42)
x = np.random.randn(500)
y = x + np.random.randn(500) * 0.5

# Create joint plot
sns.jointplot(x=x, y=y, kind='scatter',
              marginal_kws={'bins': 30,
                           'color': 'skyblue'},
              joint_kws={'alpha': 0.5})
plt.show()

Joint Plot Types

  • scatter: Scatter plot (default)
  • hex: Hexbin plot
  • kde: KDE contours
  • reg: With regression line

Seaborn: Best Practices Summary

Practice Recommendation
Style Set style once at beginning: sns.set_style("whitegrid")
Context Use sns.set_context("talk") for presentations
Color Palette Choose appropriate palette for data type (categorical/sequential/diverging)
Figure Size Set before plotting: plt.figure(figsize=(10, 6))
Data Format Use pandas DataFrame for easy integration
Annotations Add annot=True to heatmaps for values
Statistical Info Use ci parameter in barplot for confidence intervals
Combining Plots Use FacetGrid or pairplot for multi-dimensional data
Customization Combine with Matplotlib for fine-tuning
Documentation Check seaborn.pydata.org for examples

Questions?

ICT center

Phally Makara